Background

Our team has chosen the Ames Housing dataset for our Data 606 final project. This is an extremely interesting dataset with many features, which makes it well suited for analysis. This dataset is well suited for regression analysis in addition to utilizing sampling techniques and data imputation methods.

The primary question that we intend to answer with this project is: Can we accurately predict the sale price of a house given variables describing the house? This is of interest to us as understanding what factors affect a home price is useful when purchasing and selling a home, which, most people will do at some time in their life and it is a major purchase. The specific data set is for the city of Ames, Iowa, so in that respect it is not especially interesting as we will probably never purchase or sell a home in Ames, but the techniques and methods we employ during our project will be useful in the future.

Introduction to the dataset

In this report we intend to explore this data set looking for relationships between the target variable and the predictors. There is a large number of initial features (80), with some categorical data as well as numerical data. Below we detail out each of the individual variables that comprise our data set.

Name Description Type
SalePrice The property’s sale price in dollars. This is the target variable that you’re trying to predict. Numeric
MSSubClass Identifies the type of dwelling involved in the sale. Category - Nominal
MSZoning Identifies the general zoning classification of the sale. Category - Nominal
LotFrontage Linear feet of street connected to property Numeric
LotArea Lot size in square feet Numeric
Street Type of road access Category - Nominal
Alley Type of alley access Category - Nominal
LotShape General shape of property Category - Nominal
LandContour Flatness of the property Category - Nominal
Utilities Type of utilities available Category - Nominal
LotConfig Lot configuration Category - Nominal
LandSlope Slope of property Category - Nominal
Neighborhood Physical locations within Ames city limits Category - Nominal
Condition1 Proximity to main road or railroad Category - Nominal
Condition2 Proximity to main road or railroad (if a second is present) Category - Nominal
BldgType Type of dwelling Category - Nominal
HouseStyle Style of dwelling Category - Nominal
OverallQual Overall material and finish quality Category - Ordinal
OverallCond Overall condition rating Category - Ordinal
YearBuilt Original construction date Date
YearRemodAdd Remodel date Date
RoofStyle Type of roof Category - Nominal
RoofMatl Roof material Category - Nominal
Exterior1st Exterior covering on house Category - Nominal
Exterior2nd Exterior covering on house (if more than one material) Category - Nominal
MasVnrType Masonry veneer type Category - Nominal
MasVnrArea Masonry veneer area in square feet Numeric
ExterQual Exterior material quality Category - Nominal
ExterCond Present condition of the material on the exterior Category - Nominal
Foundation Type of foundation Category - Nominal
BsmtQual Height of the basement Category - Nominal
BsmtCond General condition of the basement Category - Nominal
BsmtExposure Walkout or garden level basement walls Category - Nominal
BsmtFinType1 Quality of basement finished area Category - Nominal
BsmtFinSF1 Type 1 finished square feet Numeric
BsmtFinType2 Quality of second finished area (if present) Category - Nominal
BsmtFinSF2 Type 2 finished square feet Numeric
BsmtUnfSF Unfinished square feet of basement area Numeric
TotalBsmtSF Total square feet of basement area Numeric
Heating Type of heating Category - Nominal
HeatingQC Heating quality and condition Category - Nominal
CentralAir Central air conditioning Category - Nominal
Electrical Electrical system Category - Nominal
1stFlrSF First Floor square feet Numeric
2ndFlrSF Second floor square feet Numeric
LowQualFinSF Low quality finished square feet (all floors) Numeric
GrLivArea Above grade (ground) living area square feet Numeric
BsmtFullBath Basement full bathrooms Numeric
BsmtHalfBath Basement half bathrooms Numeric
FullBath Full bathrooms above grade Numeric
HalfBath Half baths above grade Numeric
Bedroom Number of bedrooms above basement level Numeric
Kitchen Number of kitchens Numeric
KitchenQual Kitchen quality Category - Nominal
TotRmsAbvGrd Total rooms above grade (does not include bathrooms) Numeric
Functional Home functionality rating Category - Nominal
Fireplaces Number of fireplaces Numeric
FireplaceQu Fireplace quality Category - Nominal
GarageType Garage location Category - Nominal
GarageYrBlt Year garage was built Date
GarageFinish Interior finish of the garage Category - Nominal
GarageCars Size of garage in car capacity Numeric
GarageArea Size of garage in square feet Numeric
GarageQual Garage quality Category - Nominal
GarageCond Garage condition Category - Nominal
PavedDrive Paved driveway Category - Nominal
WoodDeckSF Wood deck area in square feet Numeric
OpenPorchSF Open porch area in square feet Numeric
EnclosedPorch Enclosed porch area in square feet Numeric
3SsnPorch Three season porch area in square feet Numeric
ScreenPorch Screen porch area in square feet Numeric
PoolArea Pool area in square feet Numeric
PoolQC Pool quality Category - Nominal
Fence Fence quality Category - Nominal
MiscFeature Miscellaneous feature not covered in other categories Category - Nominal
MiscVal $Value of miscellaneous feature Category - Nominal
MoSold Month Sold Category - Nominal
YrSold Year Sold Category - Nominal
SaleType Type of sale Category - Nominal
SaleCondition Condition of sale Category - Nominal

Methodology

Exploratory data analysis

Data cleaning

Replace Missing Values

From the original dataset, the first step of the data cleaning process was to look at the features with missing values. A function was created and a count of the missing values was generated.

## # A tibble: 19 x 2
##    Feature        NAs
##    <chr>        <int>
##  1 LotFrontage    259
##  2 Alley         1369
##  3 MasVnrType       8
##  4 MasVnrArea       8
##  5 BsmtQual        37
##  6 BsmtCond        37
##  7 BsmtExposure    38
##  8 BsmtFinType1    37
##  9 BsmtFinType2    38
## 10 Electrical       1
## 11 FireplaceQu    690
## 12 GarageType      81
## 13 GarageYrBlt     81
## 14 GarageFinish    81
## 15 GarageQual      81
## 16 GarageCond      81
## 17 PoolQC        1453
## 18 Fence         1179
## 19 MiscFeature   1406

From the description of the dataset, we know that for a number of these features, “NA” is actually a true category that we can simply fill in. The following code chunk replaces the “NA” value with a meaningful category and returns the remaining missing values from the cleaned dataset.

house.clean = replace_cats(house.orig)

From here are left with 5 features that we need to explore further. This will be outlined in the “Imputation” section.

Imputation

In order to further remove missing values from the dataset, there are 5 remaining variables that need to be explored further. The MiscFeature variable is missing 96.3% of the values, and upon reviewing the description of this category we have deemed it safe to simply remove this feature.

The following table outlines the counts of each category within the Electrical variable.

table(house.clean$Electrical)
## 
## FuseA FuseF FuseP   Mix SBrkr 
##    94    27     3     1  1334

Based on this table, we see that one of the categories, “SBrkr”, which represents the standard electrical system of a typical house, dominates the other categories. As a result, we decided to choose the dominant category to impute into the one missing value. This is done in the following code chunk.

# Impute the mode of the categorical variable
house.clean = impute_cat_mode(house.clean, "Electrical")

The following table outlines the counts of each category within the masonry veneer type variable, MasVnrTypes.

table(house.clean$MasVnrType)
## 
##  BrkCmn BrkFace    None   Stone 
##      15     445     864     128

From the above table, it is not clear if there is a straightforward way to impute values into the missing values. The following code chunk completes the following steps:

  1. Count the number of missing values in the feature column = n
  2. Select a simple random sample (n) of observations without missing values
  3. Replace the missing values with the simple random sample
house.clean = replace_cats_prop(house.clean, 'MasVnrType')

The last two variables that require imputed values are “MasVnrArea” and “LotFrontage”. This was completed using the mice package.

house.clean = impute_values(house.clean, c("MasVnrArea", 'LotFrontage'))

Remove Outliers

The following section outlines the process behind removing outliers and cleaning up unnecessary columns from the dataset.

We removed 2 data points out of the set. These 2 points are when the above ground living around is above 4000 but the sale price is less than 200000.

Feature Engineering

The following section outlines the steps taken to remove, filter and combine columns in order to create a reduced dataframe with features that are more relevant to the housing sale price.

Above Ground and Below Ground Bathrooms

FullBath: Full bathrooms above grade HalfBath: Half baths above grade

The Above Ground bathrooms variables were combined into one variable thru combining Full Baths and Half Baths. They were further categorized due to the variance of the amount of bathrooms, say if a house had 3+ bathrooms, they were combined with any house that had 2.5+ Above Ground bathrooms. We will show the categorization below. The Below Ground bathrooms where calculated in a very similar manner

Our BathAbove variable starts with: \[ BathAbove = FullBath + 0.5 * HalfBath \]

Our BathBelow variable starts with: \[ BathAbove = BsmtFullBath + 0.5 * BsmtHalfBath \]

house.clean %>% 
  mutate(colorN  = ifelse((FullBath + 0.5 * HalfBath) <= 1, "1",
                     ifelse((FullBath + 0.5 * HalfBath) >= 2.5, '2.5+', as.character(FullBath + 0.5 * HalfBath)))) %>%
ggplot(aes(y = (SalePrice), x = factor(FullBath + 0.5 * HalfBath), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue", "Blue", "Green")) +
     scale_alpha_manual(values=c("1", "0", "2","2.5+")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

house.clean %>% 
  mutate( colorN  = ifelse((BsmtFullBath + 0.5 * BsmtFullBath) <= 1, "1",
          ifelse( (BsmtFullBath + 0.5 * BsmtFullBath) >= 2.5, "2.5+", as.character( BsmtFullBath + 0.5 * BsmtFullBath )))) %>%
  ggplot(aes(y = (SalePrice), x = factor(BsmtFullBath + 0.5 * BsmtFullBath), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue", "Green", "Green")) +
     scale_alpha_manual(values=c("0", "1", "2","2.5+")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

WoodDeck Square Footage (not complete)

ggplot(house.clean, aes(y = SalePrice, x = WoodDeckSF, color = WoodDeckSF)) +
  geom_point() +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2 )

house.clean %>% 
  filter(WoodDeckSF > 0) %>%
  ggplot(aes(y = SalePrice, x = WoodDeckSF, color = WoodDeckSF)) +
  geom_point() +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
  geom_smooth(method='lm', formula= y~x)

Type of Sale

The type of sale variable, “SaleType”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.

house.clean %>%
  ggplot(aes(x = SaleType, y = SalePrice)) +
  scale_y_continuous(labels = scales::dollar) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2)

Due to the limited variation between categories, this variable was reduced into three categories: New, Deed and Other. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>%
  mutate(SaleType = ifelse(SaleType == 'New',
                     "New",
                     ifelse(SaleType %in% c("WD", "CWD", "VWD", "COD"),
                            "Deed",
                            "Other"))) %>%
  ggplot(aes(x = SaleType, y = SalePrice)) +
  scale_y_continuous(labels = scales::dollar) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2)

Sale Condition

The sale condition variable, “SaleCondition”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.

house.clean %>%
  ggplot(aes(x = SaleCondition, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

Due to the limited variation between categories, this variable was reduced into two categories: Good and Poor. The original six categories did not have a suitable amount of variation and as a result, five of these categories were combined into the “Poor” category. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>%
  mutate(SaleCondition = ifelse(SaleCondition %in% c("Partial"),"Good", "Poor")) %>%
  ggplot(aes(x = SaleCondition, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

Overall Condition

The overall condition variable, a categorical variable with values ranging from 1 to 9, has limited variation between categories as outlined in the following chart.

house.clean %>%
  ggplot(aes(y = SalePrice, x = factor(OverallCond))) + 
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

this variable was reduced into two categories: Good and Bad. The reduced category, “Good”, now represents overall condition with values greater than and equal to 5 while “Bad” represents values less than 5. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>%
  mutate(OverallCond = ifelse(OverallCond >= 5, "Good", "Bad")) %>%
  ggplot(aes(y = SalePrice, x = OverallCond)) + 
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

Type of Foundation

The type of foundation variable, “Foundation”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.

house.clean %>%
  ggplot(aes(x = Foundation, y = SalePrice)) +
  stat_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).

Due to the limited variation between categories, this variable was reduced into two categories: “PConc” and “Other”. The “PConc” category, which represents houses with poured concrete foundations, represented the majority of the data in this dataset and its distribution was different than all other categories, hence the reduction into two categories. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>% 
  mutate(colorN = ifelse(Foundation == "PConc", Foundation, "Other")) %>%
  ggplot(aes(y = (SalePrice), x = factor(Foundation), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue" )) +
  scale_alpha_manual(values=c("PConc", "Other" )) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

The General Zoning Classification

The general zoning classification variable, “MSZoning”, has limited variation between categories and the majority of the data falls in a couple of categories, as outlined in the following chart.

house.clean %>%
  ggplot(aes(x = MSZoning, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).

Due to the limited variation between categories, this variable was reduced into two categories: “RL” and “Other”. The “RL” category, which represents houses in low density residential zones, represented the majority of the data in this dataset and its distribution was different than all other categories, hence the reduction into two categories. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>%
  mutate(MSZoning = ifelse(MSZoning == "RL", MSZoning, "Other")) %>%
  ggplot(aes(x = MSZoning, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar, limits = c(0, 300000))
## Warning: Removed 115 rows containing non-finite values (stat_boxplot).
## Warning: Removed 115 rows containing missing values (geom_point).

Total Rooms Above Grade

The total rooms above grade variable, “TotRmsAbvGrd”, a categorical variable with values ranging up to 14, has limited variation between categories as outlined in the following chart.

house.clean %>% 
  mutate(colorN = ifelse(TotRmsAbvGrd >= 7, "High", "Low")) %>%
  ggplot(aes(y = (SalePrice), x = factor(TotRmsAbvGrd), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("red", "blue")) +
  scale_alpha_manual(values=c("High", "Low")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
  xlab("Total Rooms Above Grade") +
  ylab("Sale Price") +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.title = element_blank())

Due to the limited variation between categories, was reduced into two categories: “High” and “Low”. The reduced category, “Good”, now represents overall condition with values greater than and equal to 5 while “Bad” represents values less than 5. The following chart outlines a reasonable amount of variation between these new, combined categories.

house.clean %>% 
  mutate(colorN  = ifelse(Fireplaces > 0, TRUE, FALSE)) %>%
  ggplot(aes(y = (SalePrice), x = factor(Fireplaces), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue")) +
  scale_alpha_manual(values=c(FALSE, TRUE)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
  xlab("Total Rooms Above Grade") +
  ylab("Sale Price") +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.title = element_blank())

Fireplace

Fireplaces: Number of fireplaces

house.clean %>% 
  mutate( colorN  = ifelse(Fireplaces > 0, TRUE, FALSE)) %>%
  ggplot(aes(y = (SalePrice), x = factor(Fireplaces), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue")) +
  scale_alpha_manual(values=c(FALSE, TRUE)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

The Fireplace variable was shortened simply to houses that had or did not have a fireplace. There wasn’t enough variance between houses with more than 1 fireplace, nor are their considerably more data points for houses that have more than 1 fireplace.

GarageCars

house.clean %>% 
  mutate( colorN  = ifelse(GarageCars < 3, GarageCars, 3)) %>%
  ggplot(aes(y = log(SalePrice), x = factor(GarageCars), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue", "Green", "Yellow" )) +
  scale_alpha_manual(values=c("0", "1", "2", "3")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

The Garage Cars variables was re-categorized into an ordinal category variable. Because there were not many 4 Garage Car homes, they are combined into one category of 3+ Garage homes.

SaleType

house.clean %>% 
  mutate(colorN = ifelse(SaleType == 'New',"New",
                         ifelse(SaleType %in% c("WD", "CWD", "VWD", "COD"), "Deed", "Other"))) %>%
  ggplot(aes(y = log(SalePrice),
             x = reorder(factor(SaleType), log(SalePrice), FUN = median),
             fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue", "Green" )) +
  scale_alpha_manual(values=c("New", "Deed", "Other" )) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

When considering SaleType we wanted to boxplot the values to see the variance between each category, however when considering the meaning and context of each of the categories, we grouped based on their contextual meaning of each category, ultimately re-categorizing the variables as a ordinal variable, a New Sale, Deeds sale, and other.

BasementCond (kept and categorized)

When considering Basement Condition, our boxplot chart explains that there is no variance relative to Sale Price, between a Basement Condition that is Good and Typical, between Fair and No Basement, and a Poor Basement Condition. As such, they are re-categorized to a Good, Average and Bad categorization.

house.clean %>% 
  mutate(colorN = ifelse(BsmtCond %in% c("Gd", "TA"), "Good",
                         ifelse(BsmtCond %in% c("Fa", NA), "Ave", "Bad"))) %>%
  ggplot(aes(y = (SalePrice), x = factor(BsmtCond), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("red", "black", "blue")) +
  scale_alpha_manual(values=c("Good", "Ave", "Bad")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

OverallCond

house.clean %>% 
  mutate(colorN = ifelse(OverallCond >= 5, "Good", "Bad")) %>%
  ggplot(aes(y = (SalePrice), x = factor(OverallCond), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue" )) +
  scale_alpha_manual(values=c("Good", "Bad" )) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

Heating (kept and categorized)

house.clean %>% 
  mutate(colorN = ifelse(Heating %in% c("GasA", "GasW"), "Gas", "NoGas")) %>%
  ggplot(aes(y = (SalePrice), x = factor(Heating), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue" )) +
  scale_alpha_manual(values=c("Gas", "NoGas" )) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

BsmtFinType1 (kept and categorized)

house.clean %>% 
  mutate(colorN = ifelse(BsmtFinType1 %in% c("ALQ", "BLQ", "LwQ", "Rec", "Unf"), "Ave",
                         ifelse(BsmtFinType1 == "GLQ", "Good", "None"))) %>%  
  ggplot(aes(y = (SalePrice), x = factor(BsmtFinType1), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("red", "blue", "black")) +
  scale_alpha_manual(values=c("Good", "Ave", "Bad")) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

OverallQual

house.clean %>% 
  mutate(colorN = ifelse(OverallQual >= 9, "9+",
                         ifelse(OverallQual <= 2, "0-",
                                ifelse(OverallQual > 2 & OverallQual <= 4, "3-4", OverallQual)))) %>%
  ggplot(aes(y = (SalePrice), x = factor(OverallQual), fill=factor(colorN) , color = factor(colorN))) +
  geom_boxplot() +
  scale_fill_manual(values=c("Red", "Blue","Green", "Green", "Green", "Green", "Yellow" )) +
  scale_alpha_manual(values=c("5","6","7", "8", "3-4", "9+", "0-"  )) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2)

#Scheffe Test
CRD.OverallQual = aov(log(SalePrice) ~ OverallQual, data = house.clean)
summary(CRD.OverallQual)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## OverallQual    1 155.46  155.46    2931 <2e-16 ***
## Residuals   1458  77.34    0.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fishers LSD (Least Significant Difference)
library(agricolae)
ls.OveralQual = LSD.test(CRD.OverallQual, trt='OverallQual')
ls.OveralQual
## $statistics
##      MSerror   Df     Mean      CV
##   0.05304432 1458 12.02405 1.91544
## 
## $parameters
##         test p.ajusted      name.t ntr alpha
##   Fisher-LSD      none OverallQual  10  0.05
## 
## $means
##    log(SalePrice)       std   r      LCL      UCL      Min      Max
## 1        10.79880 0.3108790   2 10.47935 11.11826 10.57898 11.01863
## 2        10.82538 0.3060823   3 10.56455 11.08622 10.47195 11.00210
## 3        11.33747 0.3067162  20 11.23645 11.43850 10.54271 11.84654
## 4        11.55715 0.2802953 116 11.51520 11.59910 10.46024 12.45293
## 5        11.78066 0.2110016 397 11.75798 11.80333 10.93298 12.34126
## 6        11.96731 0.2298875 374 11.94395 11.99067 11.23849 12.53177
## 7        12.22177 0.2108982 319 12.19648 12.24707 11.32055 12.85832
## 8        12.49719 0.2313128 168 12.46234 12.53205 11.71178 13.19561
## 9        12.79327 0.2035213  43 12.72437 12.86216 12.38422 13.32393
## 10       12.92131 0.4034591  18 12.81482 13.02779 11.98293 13.53447
##         Q25      Q50      Q75
## 1  10.68889 10.79880 10.90872
## 2  10.73702 11.00210 11.00210
## 3  11.23162 11.36490 11.48665
## 4  11.38509 11.58989 11.74006
## 5  11.67844 11.79810 11.89819
## 6  11.84313 11.98293 12.10625
## 7  12.09776 12.20678 12.34908
## 8  12.36545 12.50525 12.63216
## 9  12.67288 12.75130 12.87310
## 10 12.76224 12.97697 13.06656
## 
## $comparison
## NULL
## 
## $groups
##    log(SalePrice) groups
## 10       12.92131      a
## 9        12.79327      b
## 8        12.49719      c
## 7        12.22177      d
## 6        11.96731      e
## 5        11.78066      f
## 4        11.55715      g
## 3        11.33747      h
## 2        10.82538      i
## 1        10.79880      i
## 
## attr(,"class")
## [1] "group"
# Scheffes Test
df = scheffe.test(CRD.OverallQual, "OverallQual", group = TRUE, console = TRUE)$groups
## 
## Study: CRD.OverallQual ~ "OverallQual"
## 
## Scheffe Test for log(SalePrice) 
## 
## Mean Square Error  : 0.05304432 
## 
## OverallQual,  means
## 
##    log.SalePrice.       std   r      Min      Max
## 1        10.79880 0.3108790   2 10.57898 11.01863
## 2        10.82538 0.3060823   3 10.47195 11.00210
## 3        11.33747 0.3067162  20 10.54271 11.84654
## 4        11.55715 0.2802953 116 10.46024 12.45293
## 5        11.78066 0.2110016 397 10.93298 12.34126
## 6        11.96731 0.2298875 374 11.23849 12.53177
## 7        12.22177 0.2108982 319 11.32055 12.85832
## 8        12.49719 0.2313128 168 11.71178 13.19561
## 9        12.79327 0.2035213  43 12.38422 13.32393
## 10       12.92131 0.4034591  18 11.98293 13.53447
## 
## Alpha: 0.05 ; DF Error: 1458 
## Critical Value of F: 1.886289 
## 
## Groups according to probability of means differences and alpha level( 0.05 )
## 
## Means with the same letter are not significantly different.
## 
##    log(SalePrice) groups
## 10       12.92131      a
## 9        12.79327      a
## 8        12.49719      b
## 7        12.22177      c
## 6        11.96731      d
## 5        11.78066      e
## 4        11.55715      f
## 3        11.33747     fg
## 2        10.82538      g
## 1        10.79880      g
log(SalePrice) groups
12.85736 a
12.49719 b
12.22177 c
11.96731 d
11.78066 e
11.52484 f
10.81475 g

Age of House

There are three different measures that all relate to the age of a home: YrSold, YearBuilt, andYearRemodAdd. We wanted to combine these into a single Age variable. It should be noted that if a house did not have a renovation then YearRemodAdd is equal to YearBuilt.

Our initial estimate for this value was: \[ Age = Y_{sold} - \frac{Y_{built} + Y_{renov}}{2} \]

Which, can be rewritten as: \[ Age = Y_{sold} - w \cdot Y_{built} + (1-w) \cdot Y_{renov} \]

We then wanted to optimize the value of \(w\) so we built a small linear regression model and ran a grid search of \(w\) values using 10-Fold Cross Validation. After a number of iterations we found an optimized value of \(w=0.58\) which minimized the RMSE of the 10-Fold Cross Validation.

results = data.frame() # create an empty dataframe to store the result of the grid search

age_seq = seq(0.57,0.59,by=0.001) # final sequence of values to search through
old_age = 30
for(w in age_seq) {
  # create a temporary data frame with the new formula for age
  temp.data = house.clean %>% mutate(Age = YrSold - (w * YearBuilt + (1-w) * YearRemodAdd),
                                 Old = Age >= old_age)

  # set up the cross validation
  set.seed(123)
  train.control = trainControl(method = "cv", number = 10)

  # Train the model
  model = train(log(SalePrice) ~ GrLivArea + Age:Old, data = temp.data,
               method = "lm",
               trControl = train.control)
  # save the results
  results = rbind(results, cbind(w=w, model$results))
}

# display the results of the grid search
results
##        w intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 1  0.570      TRUE 0.2115273 0.7255814 0.1529785 0.02305752 0.05631114
## 2  0.571      TRUE 0.2115279 0.7255836 0.1529698 0.02306409 0.05632792
## 3  0.572      TRUE 0.2115270 0.7255957 0.1529396 0.02307069 0.05635035
## 4  0.573      TRUE 0.2115278 0.7255972 0.1529313 0.02307722 0.05636720
## 5  0.574      TRUE 0.2115288 0.7255982 0.1529232 0.02308373 0.05638409
## 6  0.575      TRUE 0.2115496 0.7255575 0.1529215 0.02310546 0.05640636
## 7  0.576      TRUE 0.2115509 0.7255576 0.1529140 0.02311191 0.05642315
## 8  0.577      TRUE 0.2115524 0.7255572 0.1529070 0.02311834 0.05643997
## 9  0.578      TRUE 0.2115541 0.7255565 0.1529003 0.02312476 0.05645682
## 10 0.579      TRUE 0.2115559 0.7255553 0.1528935 0.02313115 0.05647369
## 11 0.580      TRUE 0.2115732 0.7255053 0.1529106 0.02311000 0.05646077
## 12 0.581      TRUE 0.2115753 0.7255035 0.1529038 0.02311643 0.05647781
## 13 0.582      TRUE 0.2115798 0.7254900 0.1529158 0.02312298 0.05650494
## 14 0.583      TRUE 0.2115821 0.7254874 0.1529093 0.02312935 0.05652196
## 15 0.584      TRUE 0.2115671 0.7255435 0.1528804 0.02314697 0.05660164
## 16 0.585      TRUE 0.2115696 0.7255403 0.1528747 0.02315332 0.05661875
## 17 0.586      TRUE 0.2115724 0.7255367 0.1528693 0.02315964 0.05663589
## 18 0.587      TRUE 0.2116236 0.7253551 0.1529105 0.02310208 0.05665449
## 19 0.588      TRUE 0.2116265 0.7253508 0.1529045 0.02310846 0.05667181
## 20 0.589      TRUE 0.2116296 0.7253462 0.1528985 0.02311483 0.05668915
## 21 0.590      TRUE 0.2116329 0.7253412 0.1528925 0.02312117 0.05670652
##          MAESD
## 1  0.008881293
## 2  0.008883075
## 3  0.008865666
## 4  0.008867364
## 5  0.008868974
## 6  0.008900377
## 7  0.008902413
## 8  0.008904215
## 9  0.008905922
## 10 0.008907649
## 11 0.008873899
## 12 0.008875582
## 13 0.008860559
## 14 0.008862365
## 15 0.008849782
## 16 0.008852370
## 17 0.008855259
## 18 0.008826186
## 19 0.008828753
## 20 0.008831339
## 21 0.008833944
# and now we can find the optimal weight for YearBuilt
results %>% filter(RMSE == min(RMSE))
##       w intercept     RMSE  Rsquared       MAE     RMSESD RsquaredSD
## 1 0.572      TRUE 0.211527 0.7255957 0.1529396 0.02307069 0.05635035
##         MAESD
## 1 0.008865666

Age

w_age = 0.579
new_age = 5
old_age = 30
ggplot(house.clean, aes(x=(YrSold - (w_age * YearBuilt + (1-w_age) * YearRemodAdd)),
                        y=SalePrice), color=SalePrice) +
  geom_point() +
  geom_smooth(formula= y~x) +
  scale_y_continuous(labels = scales::dollar) +
  xlab("Age")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

YrSold - (w_age * YearBuilt + (1-w_age) * YearRemodAdd)

There were a number of variables that were connected and in an effort to reduce similar variables in our feature engineering process, we created the following formula to convert YearBuilt, YearRemodAdd, W_age. Combining these variables returned the linear regression as seen above.

Neighborhood

We know from everyday life that the neighborhood that you live in is related to the price of your house, but we suspected many neighborhoods to share similar characteristics. We analyzed this in some depth as follows.

ggplot(house.clean, aes(y = log(SalePrice), x = factor(Neighborhood), color = factor(Neighborhood))) +
  geom_boxplot() +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Neighbourhood") +
  ylab("Sale Price (log)")

ggplot(house.clean, aes(y = log(SalePrice), x = reorder(factor(Neighborhood), log(SalePrice), FUN = median), color = factor(Neighborhood))) +
  geom_boxplot() +
  geom_point(position = position_jitter(width = 0.2, height = 0.1), alpha = 0.2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "None") +
  xlab("Neighbourhood") +
  ylab("Sale Price (log)")

   log(SalePrice) Scheffe TeamHacksaw
NoRidge 12.68 a a
NridgHt 12.62 a a
StoneBr 12.59 ab a
Timber 12.36 abc b
Veenker 12.34 abcd b
Somerst 12.30 bcd b
ClearCr 12.24 bcd c
Crawfor 12.21 bcd c
Blmngtn 12.17 bcde d
CollgCr 12.16 cde d
Gilbert 12.16 cde d
NWAmes 12.13 cde d
SawyerW 12.09 cde e
Mitchel 11.93 def f
NAmes 11.87 defg f
NPkVill 11.87 defgh f
SWISU 11.84 defgh f
Blueste 11.83 defgh f
Sawyer 11.81 efgh f
Edwards 11.71 fgh g
OldTown 11.70 fgh g
BrkSide 11.68 fgh g
BrDale 11.55 fgh h
MeadowV 11.47 gh h
IDOTRR 11.45 h h

Alley

house.clean %>%
  group_by(Alley) %>%
  ggplot(aes(x = Alley, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the Alley, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “None” category and the distribution of SalePrice is not significantly different between categories.

BldgType

house.clean %>%
  group_by(BldgType) %>%
  ggplot(aes(x = BldgType, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the BldgType, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “1Fam” category and the distribution of SalePrice is not significantly different between categories.

BsmtExposure

house.clean %>%
  group_by(BsmtExposure) %>%
  ggplot(aes(x = BsmtExposure, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the BsmtExposure, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “No” category and the distribution of SalePrice is not significantly different between categories.

HeatingQC

house.clean %>%
  group_by(HeatingQC) %>%
  ggplot(aes(x = HeatingQC, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the HeatingQC, the chart shows that the variable doesn’t contribute to the selling price as the distribution of SalePrice is not significantly different between categories.

LotShape

house.clean %>%
  group_by(LotShape) %>%
  ggplot(aes(x = LotShape, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the LotShape, the chart shows that the variable doesn’t contribute to the selling price as the distribution of SalePrice is not significantly different between categories.

Pool Area

house.clean %>%
  filter(PoolArea > 0) %>%
  ggplot(aes(x = PoolArea, y = SalePrice)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the Pool Area, the chart shows that the variable doesn’t contribute to the selling price as there are very few houses sold with a pool in the dataset. Additionally, there is no relationship between Pool Area and SalePrice.

Fence

house.clean %>%
  group_by(Fence) %>%
  ggplot(aes(x = Fence, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the BsmtExposure, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “None” category and the distribution of SalePrice is not significantly different between categories.

MiscVal

house.clean %>%
  filter(MiscVal > 0, MiscVal < 2500) %>%
  ggplot(aes(x = MiscVal, y = SalePrice)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  geom_smooth(method = 'lm')

When investigating the MiscVal variable, there is no distinct correlation between sale price and the misc values people assign the property.

Electrical

house.clean %>%
  group_by(Electrical) %>%
  ggplot(aes(x = Electrical, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the Electrical variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “SBrkr” category and the distribution of SalePrice is not significantly different between categories.

ScreenPorch

house.clean %>%
  mutate(Screen = ScreenPorch > 0) %>%
  ggplot(aes(x = Screen, y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the ScreenPorch variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “FALSE” category and the distribution of SalePrice is not significantly different between categories.

BedroomAbvGr

house.clean %>%
  ggplot(aes(x = as.factor(BedroomAbvGr), y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the BedroomAbvGr variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “3” category and the distribution of SalePrice is not significantly different between categories.

HouseStyle

house.clean %>%
  ggplot(aes(x = reorder(HouseStyle, SalePrice), y = SalePrice)) +
  geom_boxplot() +
  geom_point(position = 'jitter', alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar)

When we reviewed the HouseStyle variable, the chart shows that the variable doesn’t contribute to the selling price as the majority of the data falls in the “1Story” and “2Story” categories and the distribution of SalePrice is not significantly different between categories.

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF

The variables of BsmtFinSF1, BsmtFinSF2 and BsmtUnfSF were removed because they are redundant. The information is also a part of the BasementFin which is the square footage of the basement.

BsmtFullBath, BsmtHalfBath

The variables of BsmtFullBath and BsmtHalfBath were merged together because they are redundant.

Exterior2nd

The variable of Exterior2nd was removed because this is redundant. The information is also a part of the Exterior.

GarageArea, GarageCars, GarageCond, GarageFinish, GarageQual, GarageType, GarageYrBlt

These variables were removed because they were redundant. All of these variables are merged into 1 variable.

Removed Variables

Condition1 (Removed! Hacksawed)

Condition1: Proximity to various conditions Artery Adjacent to arterial street Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200’ of North-South Railroad RRAn Adjacent to North-South Railroad PosN Near positive off-site feature–park, greenbelt, etc. PosA Adjacent to postive off-site feature RRNe Within 200’ of East-West Railroad RRAe Adjacent to East-West Railroad

#Condition1 
ggplot(house.clean, aes( x = factor(Condition1))) +
  geom_histogram(stat="count") 

Feature Engineering - Creating the Final DataFrame

Functional, BathBelow, BasementCond, GasFurnace, ExteriorCovering

p2 = ggplot(house.feature, aes(x=factor(Functional))) +
  geom_histogram(stat="count")
p3 = ggplot(house.feature, aes(x=factor(BathBelow))) +
  geom_histogram(stat="count")
p4 = ggplot(house.feature, aes(x=factor(BasementCond))) +
  geom_histogram(stat="count")
p5 = ggplot(house.feature, aes(x=factor(GasFurnace))) +
  geom_histogram(stat="count")
p6 = ggplot(house.feature, aes(x=factor(ExteriorCovering))) +
  geom_histogram(stat="count")
p7 = ggplot(house.clean, aes(x=factor(Condition1))) +
  geom_histogram(stat="count")
p8 = ggplot(house.clean, aes(x=factor(Condition2))) +
  geom_histogram(stat="count")
p9 = ggplot(house.clean, aes(x=factor(KitchenAbvGr))) +
  geom_histogram(stat="count")
p10 = ggplot(house.clean, aes(x=factor(LandContour))) +
  geom_histogram(stat="count")
p11 = ggplot(house.clean, aes(x=factor(LandSlope))) +
  geom_histogram(stat="count")
p12 = ggplot(house.clean, aes(x=factor(LotConfig))) +
  geom_histogram(stat="count")
p13 = ggplot(house.clean, aes(x=factor(PoolQC))) +
  geom_histogram(stat="count")
p14 = ggplot(house.clean, aes(x=factor(RoofMatl))) +
  geom_histogram(stat="count")
p15 = ggplot(house.clean, aes(x=factor(RoofStyle))) +
  geom_histogram(stat="count")
p16 = ggplot(house.clean, aes(x=factor(Street))) +
  geom_histogram(stat="count")
p17 = ggplot(house.clean, aes(x=factor(Utilities))) +
  geom_histogram(stat="count")

plot_grid(p2, p3, p4, p5, p6, p7, p8,  ncol=2)

plot_grid(p9, p10, p11, p12,p13,p14, p15, p16, p17, ncol=2)

These histograms show there isn’t as wide enough spread in the distribution so we decided to remove the following variables:

  • Functional
  • BathBelow
  • BasementCond
  • GasFurnace
  • ExteriorCovering
  • Condition1
  • Condition2
  • KitchenAbvGr
  • LandContour
  • LandSlope
  • LotConfig
  • PoolQC
  • RoofMatl
  • RoofStyle
  • Street
  • Utilities

When considering considering the Condition1 variable, there was not enough variance to consider as part of our model.

Sampling

Since our data is naturally stratified, and stratified sampling is always a more precise estimation than simple random sampling, we are using stratified sampling.

To further justify our use of stratified sampling we have created an ANOVA table which shows that the variation between our strata is greater than the variation within our strata.

summary(aov(log(house.feature$SalePrice) ~ house.feature$Neighborhood))
##                              Df Sum Sq Mean Sq F value Pr(>F)    
## house.feature$Neighborhood    5  131.6   26.32   377.7 <2e-16 ***
## Residuals                  1452  101.2    0.07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Analysis

Results

Conclusion

References

Cook, D. (2011, June). Ames, Iowa: Alternative to Boston Housing Data as an End of Semester Regression Project[Online]. Available at: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (Retrieved January 20, 2020)